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Abstract 

We introduce a convex optimization modeling framework that transforms a convex 
optimization problem expressed in a form natural and convenient for the user into an 
equivalent cone program in a way that preserves fast linear transforms in the original 
problem. By representing linear functions in the transformation process not as matrices, 
but as graphs that encode composition of linear operators, we arrive at a matrix-free 
cone program, ie., one whose data matrix is represented by a linear operator and its 
adjoint. This cone program can then be solved by a matrix-free cone solver. By com¬ 
bining the matrix-free modeling framework and cone solver, we obtain a general method 
for efficiently solving convex optimization problems involving fast linear transforms. 
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1 Introduction 


Convex ootimization modeline systems like YALMIP |Lof04] . CVX |GB14]. CVXPY ra, 
and Convex.] 1 |UMZ'*~14] provide an automated framework for converting a convex optimiza¬ 
tion problem expressed in a natural human-readable form into the standard form required 
by a solver, calling the solver, and transforming the solution back to the human-readable 
form. This allows users to form and solve convex optimization problems quickly and ef¬ 
ficiently. These systems easily handle problems with a few thousand variables, as well as 
much larger problems (say, with hundreds of thousands of variables) with enough sparsity 
structure, which generic solvers can exploit. 

The overhead of the problem transformation, and the additional variables and constraints 
introduced in the transformation process, result in longer solve times than can be obtained 
with a custom algorithm tailored specifically for the particular problem. Perhaps surprisingly, 
the additional solve time (compared to a custom solver) for a modeling system coupled to a 
generic solver is often not as much as one might imagine, at least for modest sized problems. 
In many cases the convenience of easily expressing the problem makes up for the increased 
solve time using a convex optimization modeling system. 

Many convex optimization problems in applications like signal and image processing, or 
medical imaging, involve hundreds of thousands or many millions of variables, and so are 
well out of the range that current modeling systems can handle. There are two reasons for 
this. First, the standard form problem that would be created is too large to store on a 
single machine, and second, even if it could be stored, standard interior-point solvers would 
be too slow to solve it. Yet many of these problems are readily solved on a single machine 
by custom solvers, which exploit fast linear transforms in the problems. The key to these 
custom solvers is to directly use the fast transforms, never forming the associated matrix. 
For this reason these algorithms are sometimes referred to as matrix-free solvers. 

The literature on matrix-free solvers in signal and image processing is extensive; see, 
e.g., |BTn9al IBTOQb) ICENObl ICPlll 100091IZPBO?] . There has been particular interest in 
matrix-free solvers for LASSO and basis pursuit denoising problems |BT09bl ICDS98( lFGZ12( 
IFNW071 iKKL^Ofl lvdBF09] . Matrix-free solvers have also been developed for specialized 
control problems |VB93l IVB95] . The most general matrix-free solvers target semi definite 
programs |KS09] or quadratic programs and related problems |SKM'*~13( IGonl^ . The soft¬ 
ware closest to a convex optimization modeling system for matrix-free problems is TFOGS, 
which allows users to specify many types of convex problems and solve them using a variety 
of matrix-free first-order methods |BGG11| . 

To better understand the advantages of matrix-free solvers, consider the nonnegative 
deconvolution problem 

minimize ||c=t:x —6|p 
subject to a: > 0, 


( 1 ) 


where x E R” is the optimization variable, c G R” and b E are problem data, 

and * denotes convolution. Note that the problem data has size 0{n). There are many 
custom matrix-free methods for efficiently solving this problem, with 0{n) memory and a 
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few hundred iterations, each of which costs 0{n\ogn) floating point operations (flops). It 
is entirely practical to solve instances of this problem of size n = 10^ on a single computer 
[KBTTllLL^ . 

Existing convex optimization modeling systems fall far short of the efficiency of matrix- 
free solvers on problem ([1]). These modeling systems target a standard form in which a 
problem’s linear structure is represented as a sparse matrix. As a result, linear functions must 
be converted into explicit matrix multiplication. In particular, the operation of convolving 
by c will be represented as multiplication by a (2n — 1) x n Toeplitz matrix C. A modeling 
system will thus transform problem ([1]) into the problem 

minimize \\Cx — 6|p , . 

subject to a: > 0, ^ 

as part of the conversion into standard form. 

Once the transformation from ([I]) to ([2]) has taken place, there is no hope of solving 
the problem efficiently. The explicit matrix representation of C requires O(n^) memory. A 
typical interior-point method for solving the transformed problem will take a few tens of 
iterations, each requiring 0{'n?) flops. For this reason existing convex optimization modeling 
systems will struggle to solve instances of problem ([1]) with n = 10^, and when they are able 
to solve the problem, they will be dramatically slower than custom matrix-free methods. 

The key to matrix-free methods is to exploit fast algorithms for evaluating a linear func¬ 
tion and its adjoint. We call an implementation of a linear function that allows us to evaluate 
the function and its adjoint a forward-adjoint oracle (FAO). In this paper we describe a new 
algorithm for converting convex optimization problems into standard form while preserving 
fast linear functions. (A preliminary version of this paper appeared in |DB15] .l This yields a 
convex optimization modeling system that can take advantage of fast linear transforms, and 
can be used to solve large problems such as those arising in image and signal processing and 
other areas, with millions of variables. This allows users to rapidly prototype and implement 
new convex optimization based methods for large-scale problems. As with current modeling 
systems, the goal is not to attain (or beat) the performance of a custom solver tuned for the 
specihc problem; rather it is to make the specihcation of the problem straightforward, while 
increasing solve times only moderately. 

The outline of our paper is as follows. In §5] we give many examples of useful FAOs. In 
§2] we explain how to compose FAOs so that we can efficiently evaluate the composition and 
its adjoint. In §l]we describe cone programs, the standard intermediate-form representation 
of a convex problem, and solvers for cone programs. In ^ we describe our algorithm for 
converting convex optimization problems into equivalent cone programs while preserving fast 
linear transforms. In §2] we report numerical results for the nonnegative deconvolution prob¬ 
lem ([T]) and a special type of linear program, for our implementation of the abstract ideas 
in the paper, using versions of the existing cone solvers SCS |OCPB15] and BOGS |FB15| 
modihed to be matrix-free. (The details of these modihcations will be described elsewhere.) 
Even with our simple, far from optimized matrix-free cone solver, we demonstrate scaling to 
problems far larger than those that can be solved by generic methods (based on sparse ma- 
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trices), with acceptable performance loss compared to specialized custom algorithms tuned 
to the problems. 

We reserve certain details of our matrix-free canonicalization algorithm for the appendix. 
In ^ we explain the precise sense in which the cone program output by our algorithm is 
equivalent to the original convex optimization problem. In ^ we describe how existing 
modeling systems generate a sparse matrix representation of the cone program. The details 
of this process have never been published, and it is interesting to compare with our algorithm. 


2 Forward-adjoint oracles 

2.1 Definition 

A general linear function / : R” —>■ R”* can be represented on a computer as a dense matrix 
A G using 0{mn) bytes. We can evaluate /(x) on an input x G R” in 0{mn) flops 

by computing the matrix-vector multiplication Ax. We can likewise evaluate the adjoint 
f*i.y) — input y G R™ in 0{mn) flops by computing AAy. 

Many linear functions arising in applications have structure that allows the function and 
its adjoint to be evaluated in fewer than 0{mn) flops or using fewer than 0{mn) bytes of 
data. The algorithms and data structures used to evaluate such a function and its adjoint 
can differ wildly. It is thus useful to abstract away the details and view linear functions as 
forward-adjoint oracles (FAOs), i.e., a tuple T = (/,$/, $/*) where / is a linear function, 
is an algorithm for evaluating /, and $/* is an algorithm for evaluating /*. We use n to 
denote the size of /’s input and m to denote the size of /’s output. 

While we focus on linear functions from R” into R™, the same techniques can be used to 
handle linear functions involving complex arguments or values, i.e., from C” into C™, from 
R*" into C^, or from C"' into R”^, using the standard embedding of complex n-vectors into 
real 2U-vectors. This is useful for problems in which complex data arise naturally {e.g., in 
signal processing and communications), and also in some cases that involve only real data, 
where complex intermediate results appear (typically via an FFT). 

2.2 Vector mappings 

We present a variety of FAOs for functions that take as argument, and return, vectors. 

Scalar multiplication. Scalar multiplication by a G R is represented by the FAO F = 
(/, $/, $/*), where / : R"' —?■ R” is given by /(x) = ax. The adjoint f* is the same as /. 
The algorithms and simply scale the input, which requires 0{m-\-n) flops and 0(1) 
bytes of data to store a. Here m = n. 

Multiplication by a dense matrix. Multiplication by a dense matrix A G is 

represented by the FAO F = (/, <l>j, d>j*), where f(x) = Ax. The adjoint f*(u) = A'^u is 
also multiplication by a dense matrix. The algorithms and are the standard dense 
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matrix multiplication algorithm. Evaluating and requires 0{mn) flops and 0{mn) 
bytes of data to store A and A^. 


Multiplication by a sparse matrix. Multiplication by a sparse matrix A G 
he., a matrix with many zero entries, is represented by the FAO E = (/, d>j, d>j.), where 
f{x) = Ax. The adjoint f*{u) = A"^u is also multiplication by a sparse matrix. The 
algorithms and <h/» are the standard algorithm for multiplying by a sparse matrix in 
(for example) compressed sparse row format. Evaluating and $/» requires 0(nnz(A)) 
flops and 0(nnz(A)) bytes of data to store A and A^, where nnz is the number of nonzero 
elements in a sparse matrix |Dav061 Chap. 2]. 

Multiplication by a low-rank matrix. Multiplication by a matrix A G R™^"^ with rank 
k, where k m and A; -C n, is represented by the FAO F = (/, $/, $/*), where f{x) = Ax. 
The matrix A can be factored as A = BC, where B G R”^^^ and C G The adjoint 

f*{u) = is also multiplication by a rank k matrix. The algorithm evaluates f{x) 

by first evaluating z = Cx and then evaluating f{x) = Bz. Similarly, <hj-* multiplies by 
and then . The algorithms and require 0{k{m + n)) flops and use 0{k{m + n)) 
bytes of data to store B and C and their transposes. Multiplication by a low-rank matrix 
occurs in many applications, and it is often possible to approximate multiplication by a full 
rank matrix with multiplication by a low-rank one, using the singular value decomposition 
or methods such as sketching |Libl3] . 


Discrete Fourier transform. The discrete Fourier transform (DFT) is represented by 
the FAO r = (/, <hj-, where / : R^^ —)■ R^^ is given by 


f{x)k 

f{^}k+p 


Vp 

1 c> 

Vp 


(n. 


OJ' 


u-m-A 

p 


X 


A 


j 'O i OJp 




■p 


Xjj^p 

Xj^p 


ioT k = 1,... ,p. Here Up = e~’^'^'^lp. The adjoint f* is the inverse DFT. The algorithm 
is the fast Fourier transform (FFT), while is the inverse FFT. The algorithms can be 
evaluated in 0{{m+n) \og{m+n)) flops, using only 0(1) bytes of data to store the dimensions 
of /’s input and output |CT65( ILoa92] . Here m = n = 2p. There are many fast transforms 
derived from the DFT, such as the discrete Hartley transform |Bra84] and the discrete sine 
and cosine transforms |ANR741 IMar94] . with the same computational complexity as the 
FFT. 


Convolution. Convolution with a kernel c G R^ is defined as / ; R"' —>■ R”*, where 

f{x)k= ^ CiXj, k = l,...,m. (3) 

Different variants of convolution restrict the indices i,j to different ranges, or interpret vector 
elements outside their natural ranges as zero or using periodic (circular) indexing. 


6 
















Standard (column) convolution takes m = n + p — 1, and defines q and Xj in ([3]) as zero 
when the index is ouside its range. In this case the associated matrix Col(c) G jg 

Toeplitz, with each column a shifted version of c: 


Cl 

C2 


Col(c) 



Another standard form, row convolution, restricts the indices in ([3]) to the range k = 
p,... ,n. For simplicity we assume that n> p. In this case the associated matrix Row(c) G 

J^n-p+lxn 

is Toeplitz, with each row a shifted version of c, in reverse order; 


Row(c) = 


Cp ^p— 1 


Cl 


Cp Cp—i ... Cl 

The matrices Col(c) and Row(c) are related by the equalities 

Co^c)"^ = Ro-w(rev(c)), Row(c)'^ = Col(rev(c)), 


where rev(c)fc = Cp^k+i reverses the order of the entries of c. 

Yet another variant on convolution is circular convolution, where we take p = n and 
interpret the entries of vectors outside their range modulo n. In this case the associated 
matrix Circ(c) G R"-^"' is Toeplitz, with each column and row a (circularly) shifted version 
of c: 


Circ(c) 


Cl 

Cfi Cji—i 



C2 

C2 

Cl Cfi 




C3 

C2 







Cji 

Cn—1 



C2 

Cl 

Cn 

Cn 


C3 

C2 

Cl 


Column convolution with c G R^ is represented by the FAO F = where 

/ : R" —)■ is given by f{x) = Col{c)x. The adjoint f* is row convolution with 

rev(c), he., f*{u) = Row(rev(c))M. The algorithms and <!>/* are given in algorithms [1] 
and |2l and require 0{{m + n + p) log(m + n + p)) flops. Here m = n + p — 1. If the kernel is 
small (he., p <C n), d*/ and instead evaluate ([3]) directly in 0{np) flops. In either case, 
the algorithms and use 0{p) bytes of data to store c and rev(c) |CLW691 |Lpa92]. 
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Algorithm 1 Column convolution c* x. 

Input: c G is a length p array, x G is a length n array, y G is a length 

n + p — 1 array. 

Extend c and a; into length n + p — 1 arrays by appending zeros, 
c EFT of c. 

X EFT of X. 

for i = l,...,n + p— Ido 

Vi ^ CjXj. 

y inverse EFT of y. 

Postcondition: y = c* x. 


Algorithm 2 Row convolution c*u. 

Input: c G R^ is a length p array, u G is a length n + p — 1 array, v G R” is a 

length n array. 

Extend rev(c) and v into length n + p — 1 arrays by appending zeros. 

c ■(— inverse EFT of zero-padded rev (c). 

u EFT of u. 

for i = l,...,n-|-p— Ido 

Vi CiUi. 

V inverse EFT of v. 

Reduce u to a length n array by removing the last p — 1 entries. 

Postcondition: v = c* u. 


Algorithm 3 Circular convolution c* x. 

Input: c G R” is a length n array, x G R"" is a length n array, y G R” is a length n array. 

c ■«— EFT of c. 

X EFT of X. 

for i = 1,..., n do 

Pi t CiXi- 

y inverse EFT of y. 

Postcondition: y = c* x. 









Circular convolution with c G R"' is represented by the FAO F = (/, $j*), where 

/ : R" —>■ R"" is given by f{x) = Circ(c)x. The adjoint f* is circular convolution with 


c = 


Cl 

Cn 

Cn—1 


C2 


The algorithms and <F/* are given in algorithm |3l and require 0{{m + n) \og{m + n)) flops. 
The algorithms and <F/* use 0{m + n) bytes of data to store c and c |CLW691 ILoa92] . 
Here m = n. 


Discrete wavelet transform. The discrete wavelet transform (DWT) for orthogonal 
wavelets is represented by the FAO F = (/, <Fj, where the function / ; R^^ —)■ R^^ 
is given by 


fix) 


DiG, 

I 2 P -2 


D p_\G p—\ 

Dp —1 Hp— I 


J2P-1 



DpGp 


DpHp 


(4) 


where Dk G R^* is defined 
Hk G are given by 


such that {Dkx)i = X 2 i and the matrices Gk G R 


2^x2'“ 


and 


Gk = 






Here g & TD and h G R'^ are low and high pass Alters, respectively, that parameterize the 
DWT. The adjoint f* is the inverse DWT. The algorithms and repeatedly convolve by 
g and h, which requires 0{q{m + n)) flops and uses 0{q) bytes to store h and g |Mal89] . Here 
m = n = 2'P. Common orthogonal wavelets include the Haar wavelet and the Daubechies 
wavelets |Dau881 IDau92] . There are many variants on the particular DWT described here. 
For instance, the product in (j3]) can be terminated after fewer than p — 1 multiplications by 
Gk and Hk |J1CH01] . Gk and Hk can be defined as a different type of convolution matrix, or 
the Alters g and h can be different lengths, as in biorthogonal wavelets |CDF92) . 


Discrete Gauss transform. The discrete Gauss transform (DGT) is represented by the 
FAO F = <F/, <F/.), where the function fY,z,h ■ R"" —t R™' is parameterized by H G 

Z G and h > 0. The function fY,z,h is given by 

n 

fY,z,hix)i = X]exp(-||i/i - Zjf/h‘^)xj, i = 1,... ,m, 

i=i 
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where yi G is the ith column of Y and Zj G R'^ is the jth column of Z. The adjoint of fY,z,h 
is the DGT fz,Y,h- The algorithms and $/■* are the improved fast Gauss transform, which 
evaluates f{x) and f*{u) to a given accuracy in Oi^dPim + n)) flops. Here p is a parameter 
that depends on the accuracy desired. The algorithms and $/» use 0{d{m + n)) bytes 
of data to store Y, Z, and h |YDGDd3] . An interesting application of the DGT is efficient 
multiplication by a Gaussian kernel |YDD05| . 

Multiplication by the inverse of a sparse triangular matrix. Multiplication by the 
inverse of a sparse lower triangular matrix L G R"^” with nonzero elements on its diagonal is 
represented by the FAO T = (/, where /(x) = L~^x. The adjoint f*{u) = 

is multiplication by the inverse of a sparse upper triangular matrix. The algorithms $f and 
are forward and backward substitution, respectively, which require 0(nnz(L)) flops and 
use 0(nnz(L)) bytes of data to store L and |Dav061 Ghap. 3]. 

Multiplication by a pseudo-random matrix. Multiplication by a matrix A G 
whose columns are given by a pseudo-random sequence (he., the first m values of the sequence 
are the first column of A, the next m values are the second column of A, etc.) is represented 
by the FAO F = (/, <Fj, *h/*), where f{x) = Ax. The adjoint f*{u) = A^u is multiplication 
by a matrix whose rows are given by a pseudo-random sequence (i. e., the first m values of 
the sequence are the first row of A^, the next m values are the second row of A^, etc.). The 
algorithms and $/* are the standard dense matrix multiplication algorithm, iterating once 

over the pseudo-random sequence without storing any of its values. The algorithms require 
0{mn) flops and use 0(1) bytes of data to store the the seed for the pseudo-random sequence. 
Multiplication by a pseudo-random matrix might appear, for example, as a measurement 
ensemble in compressed sensing |GSTVn7] . 

Multiplication by the pseudo-inverse of a graph Laplacian. Multiplication by the 
pseudo-inverse of a graph Laplacian matrix L G is represented by the FAO F = 

(/, $/, $/*), where /(x) = L^x. A graph Laplacian is a symmetric matrix with nonpositive 
off diagonal entries and the property LI = 0, he., the diagonal entry in a row is the negative 
sum of the off-diagonal entries in that row. (This implies that it is positive semi definite.) 
The adjoint f* is the same as /, since L = LA. The algorithms and $/* are one of the fast 
solvers for graph Laplacian systems that evaluate /(x) = /*(x) to a given accuracy in around 
0(nnz(L)) flops |ST041 IKOSZ13( IVisl2] . (The details of the computational complexity are 
much more involved.) The algorithms use 0(nnz(L)) bytes of data to store L. 

2.3 Matrix mappings 

We now consider linear functions that take as argument, or return, matrices. We take the 
standard inner product on matrices X, Y G 

(.Y,y)= Y. = Tr(.Y^K). 
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The adjoint of a linear fnnction / : —)■ is then the fnnction f* : R®^* R^^'^ for 

which 

Tr(/(XfF) = Tr(X^r(y)), 
holds for all X e R^’^" and Y e R®^*. 

Vec and mat. The fnnction vec : R^^”^ —)■ R^”^ is represented by the FAO T = (/, 
where f{X) converts the matrix X G R^^'^ into a vector y G R^'^ by stacking the columns. 
The adjoint f* is the function mat : R^”^ —)■ which outputs a matrix whose columns 

are successive slices of its vector argument. The algorithms and <h/» simply reinterpret 
their input as a differently shaped output in 0(1) flops, using only 0(1) bytes of data to 
store the dimensions of /’s input and output. 

Sparse matrix mappings. Many common linear functions on and to matrices are given 
by a sparse matrix multiplication of the vectorized argument, reshaped as the output matrix. 
For X G R^^'^ and /(X) = X G R®^*, 

Y = mat(A vec(X)). 

The form above describes the general linear mapping from R^^^ to R®^*; we are interested 
in cases when A is sparse, he., has far fewer than pqst nonzero entries. Examples include 
extracting a submatrix, extracting the diagonal, forming a diagonal matrix, summing the 
rows or columns of a matrix, transposing a matrix, scaling its rows or columns, and so on. 
The FAO representation of each such function is F = (/, <Fj, <Fj»), where / is given above 
and the adjoint is given by 

f*{U) = mat(A'^ vec(f/)). 

The algorithms $f and $ f* are the standard algorithms for multiplying a vector by a sparse 
matrix in (for example) compressed sparse row format. The algorithms require 0(nnz(A)) 
flops and use 0(nnz(A)) bytes of data to store A and A^ |Dav06( Chap. 2]. 

Matrix product. Multiplication on the left by a matrix A G R®^^ and on the right by 
a matrix B G R'^^* is represented by the FAO F = (/,$/,$/*), where / : R^^"^ —)■ R®^* 
is given by /(X) = AXB. The adjoint f*{U) = A"^UB"^ is also a matrix product. There 
are two ways to implement efficiently, corresponding to different orders of operations in 
multiplying out AXB. In one method we multiply by A first and B second, for a total of 
0{s{pq + qt)) flops (assuming that A and B are dense). In the other method we multiply by 
B first and A second, for a total of 0{p{qt + st)) flops. The former method is more efficient 



Similarly, there are two ways to implement $/», one requiring 0{s{pq + qt)) flops and the 
other requiring 0{p{qt + st)) flops. The algorithms and <!>/* use 0{sp + qt) bytes of data 
to store A and B and their transposes. When p = q = s = t, the flop count for and $/» 
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simplifies to O ((m + n)^'®) flops. Here m = n = pq. (When the matrices A oi B are sparse, 
evaluating f{X) and f*{U) can be done even more efficiently.) The matrix product function 
is used in Lyapunov and algebraic Riccati inequalities and Sylvester equations, which appear 
in many problems from control theory |GLAM9^ IVB95] . 


2-D discrete Fourier transform. The 2-D DFT is represented by the FAO F = (/,$/,$/*), 
where / : ^ is given by 


fiXhi 


1 

1 

yM X,s=i 


Et=i ^ 


V _ C> A v 

-^st 1 l^p j -^s+p,t 


for fc = 1,... ,p and i = 1,... ,q. Here Up = and Uq = The adjoint f* is 

the inverse 2-D DFT. The algorithm $/ evaluates f{X) by first applying the FFT to each 
row of X, replacing the row with its DFT, and then applying the FFT to each column, 
replacing the column with its DFT. The algorithm <Fj* is analogous, but with the inverse 
FFT and inverse DFT taking the role of the FFT and DFT. The algorithms <Fj and <F/* 
require 0{{m + n) log(m -|- n)) flops, using only 0(1) bytes of data to store the dimensions 
of /’s input and output |Lim901 ILoa92] . Here m = n = 2pq. 


2-D convolution. 2-D convolution with a kernel C G R^^*^ is dehned as / : R^^* 
Rmixma^ where 


fiX)ki= ^ /c = l,...,mi, £=l,...,m 2 . (5) 

il+i2=k+l,ji+j2=i+l 


Different variants of 2-D convolution restrict the indices A, ji and f 2 ;j 2 fo different ranges, 
or interpret matrix elements outside their natural ranges as zero or using periodic (circular) 
indexing. There are 2-D analogues of 1-D column, row, and circular convolution. 

Standard 2-D (column) convolution, the analogue of 1-D column convolution, takes mi = 
s + p — 1 and m 2 = t + q — 1, and defines and Xi^j^ in ([5]) as zero when the indices are 
outside their range. We can represent the 2-D column convolution H = G * X as the matrix 
multiplication 

Y = mat(Col(G) vec(X)), 
where Col(G) G R(^+P-i)(t+?-i)x.t jg 


Col(G) 


■ Col(ci) 


Col(c 2 ) • 


Col(c,) 

. Col(ci) 

Col(c 2 ) 


Col(c,) 
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are 1-D 


Here Ci,..., G are the columns of C and Col(ci),..., Col(cq) G 
column convolution matrices. 

The 2-D analogue of 1-D row convolution restricts the indices in ([S]) to the range k = 
p,... ,s and i = q,... ,t. For simplicity we assume s > p and t > q. The output dimensions 
are mi = s — p -|- 1 and m 2 = t — q + 1. We can represent the 2-D row convolution Y = C * X 
as the matrix multiplication 


Y = mat(Row(C) vec(X)), 
where Row(C') G R(*-P+i)h-9+i)xst jg ^y. 


Row(C) 


Row(Cg) Row(Cq_i) ... Row(ci) 


Row(Cq) Row(Cg_i) . . . Row(ci) 


Here Row(ci Row(cq) G R"^ p+ixs 1-D row convolution matrices. The matrices 
Col(C) and Row(C) are related by the equalities 

Col(C)'^ = Row(rev(C)), Row(C')'^ = Col(rev(C)), 

where rev(C)fc£ = Cp-k+i,q-i+i reverses the order of of the columns of C and of the entries 
in each row. 

In the 2-D analogue of 1-D circular convolution, we take p = s and q = t and interpret 
the entries of matrices outside their range modulo s for the row index and modulo t for the 
column index. We can represent the 2-D circular convolution Y = C * X as the matrix 
multiplication 

Y = mat(Circ(C') vec(X)), 


where Circ(C') G is given by: 






Circ(ci) Circ(Q) 

Circ(ct_i) 



Circ(c2) 


Circ(c 2 ) Circ(ci) 

Circ(Q) 




Circ(C') = 

Circ(c3) Circ(c2) 



Circ(Q) 

Circ(ct_i) 




Circ(c2) 

Circ(ci) 

Circ(Q) 


_ Circ(Q) 


Circ(c3) 

Circ(c2) 

Circ(ci) 

Here Circ(ci),.. 

., Circ(ct) G R^^^ are 

1-D circular convolution matrices. 


2-D column convolution with C G R^^*^ is represented by the FAO F = (/, <F/, 
where / ; R^ Xt ^ ^s+p-lXt+q-l jg gj^g^ ^y 


f{X) = mat(Col(C) vec(X)). 
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The adjoint f* is 2-D row convolution with rev(C'), he., 

f*{U) = mat(Row(rev(C')) vec(17)). 

The algorithms and are given in algorithms 0] and O and require 0{{m+n) log(m+n)) 
flops. Here m = (s+p —l)(t + g —1) and n = st. If the kernel is small (he., p s and q <C t), 
and instead evaluate ([S]) directly in 0{pqst) flops. In either case, the algorithms 
and $/* use 0{pq) bytes of data to store C and rev(C) |Loa921 Chap. 4]. Often the kernel 
is parameterized {e.g., a Gaussian kernel), in which case more compact representations of C 
and rev(C') are possible |FP021 Chap. 7]. 


Algorithm 4 2-D column convolution C * X. 

Input: C G is a length pq array. X G R*^* is a length st array. Y G jg 

a length (s -|- p — l)(t -1- g — 1) array. 

Extend the columns and rows of C and X with zeros so G, X G 
C ^ 2-D DFT of C. 

X ^ 2-D DFT of X. 
for z = l,...,s-|-p— Ido 
for j = 1,..., f -f- g — 1 do 

Yij CijXij. 

Y ■(— inverse 2-D DFT of Y. 

Postcondition: Y = C * X. 


Algorithm 5 2-D row convolution C *U. 

Input: C G R^^*^ is a length pq array. U G jg a length (s -|- p — l)(t -|- g — 1) 

array. H G R®^* is a length st array. 

Extend the columns and rows of rev(G) and V with zeros so rev(G), V G 
C <(— inverse 2-D DFT of zero-padded rev(G). 

U ^ 2-D DFT of U. 
for i = l,...,s-|-p— Ido 
for j = 1,..., f -I- g — 1 do 

Vjj A ^ij * 

V ■(— inverse 2-D DFT of V. 

Truncate the rows and columns of V so that V G R*^^ 

Postcondition: V = C * U. 
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Algorithm 6 2-D circular convolution C * X. 

Input: (7 G is a length st array. X E is a length st array. Y G R"^^* is a length 
st array. 

C E- 2-D DFT of C. 

X E- 2-D DFT of X. 

for z = 1,..., s do 
for j = 1,..., f do 

Yj^j i CijXij. 

Y E- inverse 2-D DFT of Y. 

Postcondition: Y = C * X. 


2-D circular convolution with C G R'^^* is represented by the FAO F = (/,$/,$/*), 
where f : ^ IV^Hs given by 

/(X) = mat(Circ(C) vec(X)). 

The adjoint f* is 2-D circular convolution with 



Ci,i 

Ci,t 

Oi,t_i 

Oi,2 


Cs,l 

Cs,t 

Cs,t-1 

. . . 0,,2 

c = 

Cs-l,l 

Cs-l,t 

Cs-l,t-l 

• • • Os-1,2 


02,1 

C2,t 

02,t-l 

02,2 


The algorithms and are given in algorithm [HI and require 0{{m -|- n) log(m -|- n)) 
flops. The algorithms and <F/* use 0{m + n) bytes of data to store C and C |Loa92l 
Chap. 4]. Here m = n = st. 


2-D discrete wavelet transform. The 2-D DWT for separable, orthogonal wavelets is 
represented by the FAO F = (/, $/«), where / : r2^x 2’’ jg given by 

/(X),, = IT, • • • lF,_ilT,XH/jHf_i • • • IFj, 


where k = max{ |'log 2 (z)], [log 2 (j)], 1} and hF, ^ r 2 ’'x 2 p given by 


Wu 


DkGk 

DkHk 

I 


Here D,, G,, and R, are dehned as for the 1-D DWT. The adjoint f* is the inverse 2-D 
DWT. As in the 1-D DWT, the algorithms and ^f* repeatedly convolve by the hlters 
g eB!^ and h G R”^, which requires 0{q{m + n)) flops and uses 0(g) bytes of data to store g 
and h |J1CH01] . Here m = n = 2^. There are many alternative wavelet transforms for 2-D 
data; see, e.g., ICDDYObi [SCPn^ [DVn^ IJDCPll] . 
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2.4 Multiple vector mappings 

In this section we consider linear functions that take as argument, or return, multiple vectors. 
(The idea is readily extended to the case when the arguments or return values are matrices.) 
The adjoint is dehned by the inner product 

k k 

{{xu...,Xk),{yu...,yk)) = '^{xi,yi) = '^xfyi. 

i=l i=l 

The adjoint of a linear function / : x • • • x —)■ R'"^ x • • • x R™'^ is then the function 

y* ; Rmi X ... X R"^^ ^ R"i X • • • X R^'' for which 

e. k 

'^f{xi,...,Xk)^yi = ^xj f*{yi,...,yi)i, 
i=l i=l 

holds for all (xi,..., Xk) G R"^ x • • ■ x R"'' and (|/i, ■ ■ ■ ,ye) E R™^ x • • • x R™'h Here 
/(xi,..., Xk)i and f*{yi ,..., yi)i refer to the ith output of / and /*, respectively. 

Sum and copy. The function sum : R™ x • • • x R'" —)■ R”^ with k inputs is represented 
by the FAO T = (/,$/,<!)/*), where /(xi,... ^Xk) = xi + • • • + Xfc. The adjoint f* is the 
function copy ; R™ R”^ x • • • x R'”, which outputs k copies of its input. The algorithms 
and <h/* require 0{m + n) flops to sum and copy their input, respectively, using only 
0(1) bytes of data to store the dimensions of /’s input and output. Here n = km. 

Vstack and split. The function vstack : R'"^ x • • • x R™'' —)■ R'^ is represented by the 
FAO F = (/, <l>j, <l>j»), where /(xi,...,Xfc) concatenates its k inputs into a single vector 
output. The adjoint f* is the function split : R” —)■ R'"^ x ••• x R'"'', which divides a 
single vector into k separate components. The algorithms and <!>/» simply reinterpret 
their input as a differently sized output in 0(1) flops, using only 0(1) bytes of data to store 
the dimensions of /’s input and output. Here n = m = mi + • • • + mk. 

2.5 Additional examples 

The literature on fast linear transforms goes far beyond the preceding examples. In this sec¬ 
tion we highlight a few notable omissions. Many methods have been developed for matrices 
derived from physical systems. The multigrid |Hac85) and algebraic multigrid |BMR85| meth¬ 
ods efficiently apply the inverse of a matrix representing discretized partial differential equa¬ 
tions (PDFs). The fast multipole method accelerates multiplication by matrices representing 
pairwise interactions |GR871 ICGR88] . much like the fast Gauss transform |GS91] . Heirarchi- 
cal matrices are a matrix format that allows fast multiplication by the matrix and its inverse, 
with applications to discretized integral operators and PDFs |Hac991 IHKSOOl IBGH03] . 

Many approaches exist for factoring an invertible sparse matrix into a product of com¬ 
ponents whose inverses can be applied efficiently, yielding a fast method for applying the 
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inverse of the matrix |DER86l IDavOG] . A sparse LU factorization, for instance, decomposes 
an invertible sparse matrix A E into the product A = LU of a lower triangular matrix 

L E and an upper triangular matrix U E R"^". The relationship between nnz(A), 

nnz(L), and nnz(f/) is complex and depends on the factorization algorithm |Dav061 Chap. 6]. 

We only discussed 1-D and 2-D DFTs and convolutions, but these and related transforms 
can be extended to arbitrarily many dimensions |DM841 ILoa92] . Similarly, many wavelet 
transforms naturally operate on data indexed by more than two dimensions |KV921 lYDCOSt 

fLDOT] . 

3 Compositions 

In this section we consider compositions of FAOs. In fact we have already discussed sev¬ 
eral linear functions that are naturally and efficiently represented as compositions, such as 
multiplication by a low-rank matrix and sparse matrix mappings. Here though we present 
a data structure and algorithm for efficiently evaluating any composition and its adjoint, 
which gives us an FAO representing the composition. 

A composition of FAOs can be represented using a directed acyclic graph (DAG) with 
exactly one node with no incoming edges (the start node) and exactly one node with no 
outgoing edges (the end node). We call such a representation an FAO DAG. 

Fach node in the FAO DAG stores the following attributes: 

• An FAO r = (/, <I)j, $j-*). Concretely, / is a symbol identifying the function, and 
and d)/. are executable code. 

• The data needed to evaluate <h/ and d*/*. 

• A list Fin of incoming edges. 

• A list Font of outgoing edges. 

Fach edge has an associated array. The incoming edges to a node store the arguments to the 
node’s FAO. When the FAO is evaluated, it writes the result to the node’s outgoing edges. 
Matrix arguments and outputs are stored in column-major order on the edge arrays. 

As an example, £gure[T]shows the FAO DAG for the composition f{x) = Ax-EBx, where 
A E R™'^" and B E R™'^" are dense matrices. The copy node duplicates the input x E R"" 
into the multi-argument output (x, x) E R"^ x R". The A and B nodes multiply by A and B, 
respectively. The sum node sums two vectors together. The copy node is the start node, 
and the sum node is the end node. The FAO DAG requires 0{mn) bytes to store, since 
the A and B nodes store the matrices A and B and their tranposes. The edge arrays also 
require 0{mn) bytes of memory. 

3.1 Forward evaluation 

To evaluate the composition f{x) = Ax -f Bx using the FAO DAG in hgure [H we hrst 
evaluate the start node on the input x E R"", which copies x onto both outgoing edges. 
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Figure 1: The FAO DAG for f{x) = Ax + Bx. 


We evaluate the A and B nodes (serially or in parallel) on their incoming edges, and write 
the results [Ax and Bx) to their outgoing edges. Finally, we evaluate the end node on its 
incoming edges to obtain the result Ax + Bx. 

The general procedure for evaluating an FAO DAG is given in algorithm [71 The algorithm 
evaluates the nodes in a topological order. The total flop count is the sum of the flops from 
evaluating the algorithm on each node. If we allocate all scratch space needed by the 
FAO algorithms in advance, then no memory is allocated during the algorithm. 


Algorithm 7 Evaluate an FAO DAG. 

Input: G = (V, E) is an FAO DAG representing a function /. G is a list of nodes. E is a 
list of edges. / is a list of inputs to /. O is a list of outputs from /. Each element of I 
and O is represented as an array. 

Greate edges whose arrays are the elements of I and save them as the list of incoming 
edges for the start node. 

Greate edges whose arrays are the elements of O and save them as the list of outgoing 
edges for the end node. 

Greate an empty queue Q for nodes that are ready to evaluate. 

Greate an empty set S for nodes that have been evaluated. 

Add G’s start node to Q. 
while Q is not empty do 

u pop the front node of Q. 

Evaluate u’s algorithm on u’s incoming edges, writing the result to u’s outgoing 
edges. 

Add u to S. 

for each edge e = (u, v) in n’s Eout do 

if for all edges (p, v) in n’s Ejn, p is in S' then 
Add V to the end of Q. 

Postcondition: O contains the outputs of / applied to inputs /. 
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Figure 2: The FAO DAG for f*{u) = A^u + obtained by transforming the 
FAO DAG in hgure [TJ 


3.2 Adjoint evaluation 

Given an FAO DAG G representing a function /, we can easily generate an FAO DAG G* 
representing the adjoint f*. We modify each node in G, replacing the node’s FAO (/,$/,$/*) 
with the FAO (/*,$/*,$/) and swapping F^in and -Fout- We also reverse the orientation of 
each edge in G. We can apply algorithm [7] to the resulting graph G* to evaluate f*. Figure 
|2] shows the FAO DAG in hgure [U transformed into an FAO DAG for the adjoint. 

3.3 Parallelism 

Algorithm [7] can be easily parallelized, since the nodes in the ready queue Q can be evaluated 
in any order. A simple parallel implementation could use a thread pool with t threads to 
evaluate up to t nodes in the ready queue at a time. The extent to which parallelism speeds 
up evaluation of the composition graph depends on how many parallel paths there are in 
the graph, he., paths with no shared nodes. The evaluation of individual nodes can also be 
parallelized by replacing a node’s algorithm with a parallel variant. For example, the 
standard algorithms for dense and sparse matrix multiplication can be trivially parallelized. 

3.4 Optimizing the DAG 

The FAO DAG can often be transformed so that the output of algorithm [7| is the same but 
the algorithm is executed more efficiently. Such optimizations are especially important when 
the FAO DAG will be evaluated on many different inputs (as will be the case for matrix-free 
solvers, to be discussed later). For example, the FAO DAG representing f{x) = ABx + AGx 
where A,B,G G shown in hgure |3l can be transformed into the FAO DAG in hgure HI 

which requires one fewer multiplication by A. The transformation is equivalent to rewriting 
f{x) = ABx + AGx as f{x) = A{Bx -|- Gx). Many other useful graph transformations can 
be derived from the rewriting rules used in program analysis and code generation |ALSU06] . 

Sometimes graph transformations will involve pre-computation. For example, if two 
nodes representing the composition f{x) = b'^cx, where 6, c G R”, appear in an FAO DAG, 
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Figure 4: The FAO DAG for f{x) = A{Bx + Cx). 
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the DAG can be made more efficient by evalnating a = b^c and replacing the two nodes 
with a single node for scalar mnltiplication by a. 

The optimal rewriting of a DAG will depend on the hardware and overall architectnre 
on which the mnltiplication algorithm is being rnn. For example, if the algorithm is being 
rnn on a distribnted compnting cluster then a node representing multiplication by a large 
matrix 


All Ai2 
A21 A22 ’ 


could be split into separate nodes for each block, with the nodes stored on different com¬ 
puters. This rewriting would be necessary if the matrix A is so large it cannot be stored 
on a single machine. The literature on optimizing compilers suggests many approaches to 
optimizing an FAO DAG for evaluation on a particular architecture |ALSU06] . 


3.5 Reducing the memory footprint 

In a naive implementation, the total bytes needed to represent an FAO DAG G, with node 
set V and edge set E, is the sum of the bytes of data on each node u E V and the bytes 
of memory needed for the array on each edge e E E. A more sophisticated approach can 
substantially reduce the memory needed. For example, when the same FAO occurs more 
than once in V, duplicate nodes can share data. 

We can also reuse memory across edge arrays. The key is determining which arrays can 
never be in use at the same time during algorithm [71 An array for an edge {u,v) is in use if 
node u has been evaluated but node v has not been evaluated. The arrays for edges (Mi,ni) 
and {u 2 , V 2 ) can never be in use at the same time if and only if there is a directed path from 
Vi to U 2 or from V 2 to Ui. If the sequence in which the nodes will be evaluated is fixed, rather 
than following an unknown topological ordering, then we can say precisely which arrays will 
be in use at the same time. 

The next step is to map the edge arrays onto a global array, keeping the global array as 
small as possible. Let L[e) denote the length of edge e’s array and U E x E denote the 
set of pairs of edges whose arrays may be in use at the same time. Formally, we want to 
solve the optimization problem 

minimize ma.Xe^E{ze + L{e)} 

subject to [ze,Ze + L{e)-l]n[zf,Zf + L{f)-l] = ^, {e, f) E U (6) 

e {1,2,...}, eEE, 

where the Ze are the optimization variables and represent the index in the global array where 
edge e’s array begins. 

When all the edge arrays are the same length, problem ([6]) is equivalent to finding the 
chromatic number of the graph with vertices E and edges U. Problem (jS]) is thus NP-hard 
in general |Kar72] . A reasonable heuristic for problem (jS]) is to first find a graph coloring of 
{E, U) using one of the many efficient algorithms for finding graph colorings that use a small 
number of colors; see, e.g., |Hal931 IB re 79]. We then have a mapping 0 from colors to sets of 
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edges assigned to the same color. We order the colors arbitrarily as Ci,..., and assign the 
Zf. as follows: 


1, e e 0(ci) 

max/6^(c,_i){z/ + L(/)}, e e 0(ci), i > 1. 


Additional optimizations can be made based on the nniqne characteristics of different 
FAOs. For example, the ontgoing edges from a copy node can share the incoming edge’s 
array nntil the ontgoing edges’ arrays are written to (he., copy-on-write). Another example 
is that the ontgoing edges from a split node can point to segments of the array on the 
incoming edge. Similarly, the incoming edges on a vstack node can point to segments of 
the array on the ontgoing edge. 


3.6 Software implementations 

Several software packages have been developed for constrncting and evalnating composi¬ 
tions of linear fnnctions. The MATLAB toolbox SPOT allows nsers to constrnct expressions 
involving both fast transforms, like convolntion and the DFT, and standard matrix mnlti- 
plication |HHS~*'14| . TFOCS, a framework in MATLAB for solving convex problems nsing 
a variety of first-order algorithms, provides fnnctionality for constrncting and composing 
FAOs |BCG11) . The Python package linop provides methods for constrncting FAOs and 
combining them into linear expressions |Vail3] . Halide is a domain specific langnage for 
image processing that makes it easy to optimize compositions of fast transforms for a variety 
of architectnres |RKBA+13] . 


4 Cone programs and solvers 

4.1 Cone programs 

A cone program is a convex optimization problem of the form 

minimize , 

snbject to Ax -1- 6 G /C, ^ 

where x G R"" is the optimization variable, /C is a convex cone, and A G c G R", and 

6 G R™ are problem data. Cone programs are a broad class that inclnde linear programs, 
second-order cone programs, and semidefinite programs as special cases |NN921 [BV04] . We 
call the cone program matrix-free if A is represented implicitly as an FAO, rather than 
explicitly as a dense or sparse matrix. 

The convex cone /C is typically a Cartesian prodnct of simple convex cones from the 
following list: 

• Zero cone: /Cq = {0}. 

• Free cone: /Cfree = R- 
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Nonnegative cone: /C+ = {xGR|x>0}. 

Second-order cone: JCsoc = G | x G R"', t G R, ||x ||2 < t}. 

Positive seniidefinite cone: /Cpsd = {vec(X) | X G S"', z^Xz > 0 for all z G R”}. 
Exponential cone ( [PB14| §6.3.4]): 

/Cexp = {(x, ?/, z) G R^ I 2/ > 0, < z} U {(x, ?/, z) G R^ | x < 0, ?/ = 0, z > 0}. 


• Power cone r |Nesn6l[SYl4llKhal4] P 

G R^" I xV“^ > kl, a; > 0, 2/ > 0}, 

where a G [0,1]. 

These cones are useful in expressing common problems (via canonicalization), and can be 
handled by various solvers (as discussed below). Note that all the cones are subsets of R"', 
he., real vectors. It might be more natural to view the elements of a cone as matrices 
or tuples, but viewing the elements as vectors simplifies the matrix-free canonicalization 
algorithm in ^ 

Cone programs that include only cones from certain subsets of the list above have special 
names. For example, if the only cones are zero, free, and nonnegative cones, the cone program 
is a linear program; if in addition it includes the second-order cone, it is called a second- 
order cone program. A well studied special case is so-called symmetric cone programs, 
which include the zero, free, nonnegative, second-order, and positive semidefinite cones. 
Semidefinite programs, where the cone constraint consists of a single positive semidefinite 
cone, are another common case. 


4.2 Cone solvers 


Many methods have been developed to solve cone programs, the most widely used being 
interior-point methods; see, e.g.^ |BV04( |NN94( |NW06( IWri^ lYell] . 


Interior-point. A large number of interior-point cone solvers have been implemented. 
Most support symmetric cone programs. SDPT3 |TTT99] and SeDuMi |Stu99) are open- 
source solvers implemented in MATLAB; CVXOPT | ADVI 5] is an open-source solver imple¬ 
mented in Python; MOSEK |mosl5) is a commercial solver with interfaces to many languages. 
EGOS is an open-source cone solver written in library-free C that supports second-order cone 
programs |DCB13| : Akle extended EGOS to support the exponential cone |Akll5] . DSDP5 
|BY05] and SDPA |FFK'*~08] are open-source solvers for semidefinite programs implemented 
in G and G++, respectively. 
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First-order. First-order methods are an alternative to interior-point methods that scale 
more easily to large cone programs, at the cost of lower accnracy. PDOS |COPB13] is a 
hrst-order cone solver based on the alternating direction method of mnltipliers (ADMM) 
IBPC+Il] , PDOS snpports second-order cone programs. POGS |FB15] is an ADMM based 
solver that rnns on a GPU, with a version that is similar to PDOS and targets second-order 
cone programs. SGS is another ADMM-based cone solver, which snpports symmetric cone 
programs as well as the exponential and power cones |OGPB15] . Many other hrst-order 
algorithms can be applied to cone programs {e.g., |LLMlll IGPllt IPGll] ). bnt none have 
been implemented as a robnst, general pnrpose cone solver. 

Matrix-free. Matrix-free cone solvers are an area of active research, and a small nnmber 
have been developed. PENNON is a matrix-free semidehnite program (SDP) solver |KS09| . 
PENNON solves a series of nnconstrained optimization problems nsing Newton’s method. 
The Newton step is compnted nsing a preconditioned conjngate gradient method, rather 
than by factoring the Hessian directly. Many other matrix-free algorithms for solving SDPs 
have been proposed {e.g., |GY00l IFKS02[ IToh04[ IZSTlOj h 

Several matrix-free solvers have been developed for qnadratic programs (QPs), which 
are a snperset of linear programs and a snbset of second-order cone programs. Gondzio 
developed a matrix-free interior-point method for QPs that solves linear systems using a 
preconditioned conjugate gradient method |Gonl2b[ IGonl2at IHS52] . PDGO is a matrix- 
free interior-point solver that can solve QPs |SKM^13] . using LSMR to solve linear systems 

[Fm] . 

5 Matrix-free canonicalization 

5.1 Canonicalization 

Ganonicalization is an algorithm that takes as input a data structure representing a general 
convex optimization problem and outputs a data structure representing an equivalent cone 
program. By solving the cone program, we recover the solution to the original optimization 
problem. This approach is used by convex optimization modeling systems such as YALMIP 
|T.ofn4]. GVX |GB14]. GVXPY ina, and Gonvex.jl |UMZ'*~14] . The same technique is used 
in the code generators GVXGEN |MB12] and QGML |GPDB13] . 

The downside of canonicalization’s generality is that special structure in the original 
problem may be lost during the transformation into a cone program. In particular, current 
methods of canonicalization convert fast linear transforms in the original problem into mul¬ 
tiplication by a dense or sparse matrix, which makes the hnal cone program far more costly 
to solve than the original problem. 

The canonicalization algorithm can be modihed, however, so that fast linear transforms 
are preserved. The key is to represent all linear functions arising during the canonicalization 
process as FAO DAGs instead of as sparse matrices. The FAO DAG representation of the 
hnal cone program can be used by a matrix-free cone solver to solve the cone program. The 
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modified canonicalization algorithm never forms explicit matrix representations of linear 
functions. Hence we call the algorithm matrix-free eanoniealization. 

The remainder of this section has the following outline: In ^5.21 we give an informal 
overview of the matrix-free canonicalization algorithm. In ^5.31 we define the expression 
DAG data structure, which is used throughout the matrix-free canonicalization algorithm. 
In ^5.41 we define the data structure used to represent convex optimization problems as input 
to the algorithm. In ^5.51 we define the representation of a cone program output by the 
matrix-free canonicalization algorithm. In ^5.61 we present the matrix-free canonicalization 
algorithm itself. 

For clarity, we move some details of canonicalization to the appendix. In ^^we give a 
precise definition of the equivalence between the cone program output by the canonicalization 
algorithm and the original convex optimization problem given as input. In ^ we explain 
how the standard canonicalization algorithm generates a sparse matrix representation of a 
cone program. 

5.2 Informal overview 

In this section we give an informal overview of the matrix-free canonicalization algorithm. 
Later sections define the data structures used in the algorithm and make the procedure 
described in this section formal and explicit. 

We are given an optimization problem 

minimize fo{x) 

subject to /i(a:) < 0, i = l,...,p (8) 

hi{x) dj = 0, i = 1,... ,g, 

where x G R"" is the optimization variable, /o : R” —t R,..., /p : R"" —?• R are convex 
functions, hi : R"' —)■ R”*^ ,,hq : R"' —)■ R™"' are linear functions, and di G R”*^ ,... ,dg E 

Rm, 

are vector constants. Our goal is to convert the problem into an equivalent matrix-free 
cone program, so that we can solve it using a matrix-free cone solver. 

We assume that the problem satisfies a set of requirements known as disciplined convex 
programming |Gra04t IGBY06] . The requirements ensure that each of the /o,..., /p can be 
represented as partial minimization over a cone program. Let each function /j have the cone 
program representation 

fi{x) = minimize (over -f 

subject to -1- G j = 1,..., r^^\ 

where G R^^ ^ is the optimization variable, ..., are linear functions, Cq^ ..., 
are vector constants, and ..., are convex cones. 
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Figure 5: The expression DAG for f{x) = ||Ax ||2 + 3. 


We rewrite problem ([8]) as the equivalent cone program 
minimize (x, 

subject to — Cg^ G /C+, i = 1,... ,p, 

gf\x, tW) + ef G JCf i = j = 1,..., 

hi{x) + di e 1C^\ i = 


(9) 


We convert problem ([9]) into the standard form for a matrix-free cone program given in ([7]) 
by representing as the inner product with a vector c G concatenating the di 

and vectors into a single vector b, and representing the matrix A implicitly as the linear 

function that stacks the outputs of all the hi and g^^'^ (excluding the objective gf®) into a 
single vector. 


5.3 Expression DAGs 

The canonicalization algorithm uses a data structure called an expression DAG to represent 
functions in an optimization problem. Like the FAO DAG dehned in ^ an expression DAG 
encodes a composition of functions as a DAG where a node represents a function and an 
edge from a node n to a node v signifies that an output of u is an input to v. Figure [5] shows 
an expression DAG for the composition f[x) = |Ax ||2 + 3, where x G R” and A G 

Formally, an expression DAG is a connected DAG with one node with no outgoing edges 
(the end node) and one or more nodes with no incoming edges (start nodes). Each node in 
an expression DAG has the following attributes: 

• A symbol representing a function /. 

• The data needed to parameterize the function, such as the power p for the function 
/(x) = x^. 

• A list Ein of incoming edges. 
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• A list Eout of outgoing edges. 

Each start node in an expression DAG is either a constant function or a variable. A 
variable is a symbol that labels a node input. If two nodes u and v both have incoming 
edges from variable nodes with symbol t, then the inputs to u and v are the same. 

We say an expression DAG is affine if every non-start node represents a linear function. 
If in addition every start node is a variable, we say the expression DAG is linear. We say an 
expression DAG is constant if it contains no variables, i.e., every start node is a constant. 

5.4 Optimization problem representation 

An optimization problem representation (OPR) is a data structure that represents a convex 
optimization problem. The input to the matrix-free canonicalization algorithm is an OPR. 
An OPR can encode any mathematical optimization problem of the form 

minimize (over 1 /w.r.t. /Co) fo{x,y) 

subject to fi{x, y) efCi, / = ^ ^ 

where x G R” and y G R"* are the optimization variables, /Cq is a proper cone, /Ci,... ,/Cf 
are convex cones, and for i = 0, ...,£, we have /* : R” x R"* —)■ R"** where /C* C R’^h (For 
background on convex optimization with respect to a cone, see, e.g.^ |BV041 §4.7].) 

Problem (fT0|) is more complicated than the standard dehnition of a convex optimization 
problem given in (JH]). The additional complexity is necessary so that OPRs can encode partial 
minimization over cone programs, which can involve minimization with respect to a cone and 
constraints other than equalities and inequalities. These partial minimization problems play 
a major role in the canonicalization algorithm. Note that we can easily represent equality 
and inequality constraints using the zero and nonnegative cones. 

Goncretely, an OPR is a tuple (s, o, C) where 

• The element s is a tuple (V, /C) representing the problem’s objective sense. The element 
R is a set of symbols encoding the variables being minimized over. The element K, 
is a symbol encoding the proper cone the problem objective is being minimized with 
respect to. 

• The element o is an expression DAG representing the problem’s objective function. 

• The element G is a set representing the problem’s constraints. Each element q G G is 
a tuple (ej,/Cj) representing a constraint of the form f{x,y) G /C. The element is an 
expression DAG representing the function / and /Cj is a symbol encoding the convex 
cone /C. 

The matrix-free canonicalization algorithm can only operate on OPRs that satisfy the two 
DGP requirements |Gra041 IGBY06] . The first requirement is that each nonlinear function 
in the OPR have a known representation as partial minimization over a cone program. See 
|GB08| for many examples of such representations. 
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The second requirement is that the objective o be verifiable as convex with respect to 
the cone /C in the objective sense s by the DCP composition rule. Similarly, for each 
element (cj, /Cj) G C, the constraint that the function represented by e* lie in the convex 
cone represented by /C* must be verifiable as convex by the composition rule. The DCP 
composition rule determines the curvature of a composition f{gi{x),...,gk{x)) from the 
curvatures and ranges of the arguments gi,...,gk, the curvature of the function /, and 
the monotonicity of / on the range of its arguments. See |Gra 04] and |UMZ+14j for a full 
discussion of the DCP composition rule. Additional rules are used to determine the range 
of a composition from the range of its arguments. 

Note that it is not enough for the objective and constraints to be convex. They must 
also be structured so that the DCP composition rule can verify their convexity. Otherwise 
the cone program output by the matrix-free canonicalization algorithm is not guaranteed to 
be equivalent to the original problem. 

To simplify the exposition of the canonicalization algorithm, we will also require that the 
objective sense s represent minimization over all the variables in the problem with respect 
to the nonnegative cone, i.e., the standard definition of minimization. The most general 
implementation of canonicalization would also accept OPRs that can be transformed into 
an equivalent OPR with an objective sense that meets this requirement. 

5.5 Cone program representation 

The matrix-free canonicalization algorithm outputs a tuple (carr, <^arr, ^arr, G, A^iist) where 

• The element Carr is a length n array representing a vector c 6 R". 

• The element darr is a length one array representing a scalar d G R. 

• The element 6arr is a length m array representing a vector h G R™. 

• The element G is an FAO DAG representing a linear function f{x) = Ax, where 

A G R™^”. 

• The element /Cust is a list of symbols representing the convex cones (/Ci,..., /C^) . 

The tuple represents the matrix-free cone program 

minimize c'^x + d . . 

subject to Ax -1- 6 G /C, ^ 

where /C = /Ci x • ■ ■ x JCi. 

We can use the FAO DAG G and algorithm [7] to represent A as an FAO, i.e., export 
methods for multiplying by A and A^. These two methods are all a matrix-free cone solver 
needs to efficiently solve problem flTTD . 
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Figure 6: The expression DAG for f{x) = x + 2. 


5.6 Algorithm 

The matrix-free canonicalization algorithm can be broken down into snbrontines. We de¬ 
scribe these snbrontines before presenting the overall algorithm. 

Conic-Form. The Conic-Form snbrontine takes an OPR as input and returns an equivalent 
OPR where every non-start node in the objective and constraint expression DAGs represents 
a linear function. The output of the Conic-Form subroutine represents a cone program, but 
the output must still be transformed into a data structure that a cone solver can use, e.g. the 
cone program representation described in ^5.51 

The general idea of the Conic-Form algorithm is to replace each nonlinear function in 
the OPR with an OPR representing partial minimization over a cone program. Recall 
that the canonicalization algorithm requires that all nonlinear functions in the problem be 
representable as partial minimization over a cone program. The OPR for each nonlinear 
function is spliced into the full OPR. We refer the reader to |GB08] and |UMZ'*~14] for a full 
discussion of the Conic-Form algorithm. 

The Conic-Form subroutine preserves fast linear transforms in the problem. All linear 
functions in the original OPR are present in the OPR output by Conic-Form. The only 
linear functions added are ones like sum and scalar multiplication that are very efficient to 
evaluate. Thus, evaluating the FAO DAG representing the final cone program will be as 
efficient as evaluating all the linear functions in the original problem. 

Linear and Constant. The Linear and Constant subroutines take an affine expression 
DAG as input and return the DAG’s linear and constant components, respectively. Gon- 
cretely, the Linear subroutine returns a copy of the input DAG where every constant start 
node is replaced with a variable start node and a node mapping the variable output to a vec¬ 
tor (or matrix) of zeros with the same dimensions as the constant. The Constant subroutine 
returns a copy of the input DAG where every variable start node is replaced with a zero¬ 
valued constant node of the same dimensions. Figures [7] and |8] show the results of applying 
the Linear and Constant subroutines to an expression DAG representing f{x) = x + 2, as 
depicted in figure O 

Evaluate. The Evaluate subroutine takes a constant expression DAG as input and returns 
an array. The array contains the value of the function represented by the expression DAG. 
If the DAG evaluates to a matrix A G the array represents vec(A). Similarly, if the 
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Figure 7: The Linear subroutine applied to the expression DAG in hgurelH 



Figure 8: The Constant subroutine applied to the expression DAG in figure [6l 

DAG evaluates to multiple output vectors {bi,... ,bk) G x • • • xthe array represents 
vstack(6i, ... ,bk). For example, the output of the Evaluate subroutine on the expression 
DAG in figure [H] is a length one array with first entry equal to 2. 

Graph-Repr. The Graph-Repr subroutine takes a list of linear expression DAGs, (ei,..., e^), 
and an ordering over the variables in the expression DAGs, <y, as input and outputs an 
FAO DAG G. We require that the end node of each expression DAG represent a function 
with a single vector as output. 

We construct the FAO DAG G in three steps. In the first step, we combine the expression 
DAGs into a single expression DAG RF) iiy creating a vstack node and adding an edge from 
the end node of each expression DAG to the new node. The expression DAG RF) jg shown 
in figure O 

In the second step, we transform into an expression DAG R^^^ with a single start 
node. Let xi,... ,Xk be the variables in (ci,..., e^) ordered by <y. Let n* be the length 
of Xi if the variable is a vector and of vec(xj) if the variable is a matrix, for i = 1,... ,k. 
We create a start node representing the function split : R"' —)■ R”^ x • • • x R”*" . For each 
variable Xj, we add an edge from output i of the start node to a copy node and edges from 



Figure 9: The expression DAG for vstack(ei,..., e^). 
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Figure 10: The expression DAG when £ = 1 and ei represents f{x,y) = 
X + A{x + y). 


that copy node to all the nodes representing Xj. If Xi is a vector, we replace all the nodes 
representing x* with nodes representing the identity function. If Xj is a matrix, we replace 
all the nodes representing Xj with mat nodes. The transformation from to when 
1=1 and Cl represents /(x) = x + A{x + ?/), where x,y E R"" and A G are depicted 

in figures HU] and [TTl 

In the third and final step, we transform from an expression DAG into an FAO DAG 
G. R(2) is almost an FAO DAG, since each node represents a linear function and the DAG 
has a single start and end node. To obtain G we simply add the node and edge attributes 
needed in an FAO DAG. For each node u in representing the function /, we add to u 
an FAO (/, $/,$/*) and the data needed to evaluate <F/ and <F/*. The node already has the 
required lists of incoming and outgoing edges. We also add an array to each of R’^^^’s edges. 

Optimize-Graph. The Optimize-Graph subroutine takes an FAO DAG G as input and 
outputs an equivalent FAO DAG , meaning that the output of algorithm [7] is the same 
for G and We choose by optimizing G so that the runtime of algorithm [7] is as 

short as possible (see ^3.4p . We also compress the FAO data and edge arrays to reduce the 
graph’s memory footprint (see §33]). We could optimize the graph for the adjoint, G*, as 
well, but asymptotically at least the flop count and memory footprint for G* will be the 
same as for G, meaning optimizing G is the same as jointly optimizing G and G*. 

Matrix-Repr. The Matrix-Repr subroutine takes a list of linear expression DAGs, (ei,..., e^), 
and an ordering over the variables in the expression DAGs, <y, as input and outputs a sparse 
matrix. Note that the input types are the same as in the Graph-Repr subroutine. In fact, 
for a given input the sparse matrix output by Matrix-Repr represents the same linear func- 
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Figure 11: The expression DAG obtained by transforming in figure fTOl 

tion as the FAO DAG output by Graph-Repr. The Matrix-Repr subroutine is used by the 
standard canonicalization algorithm to produce a sparse matrix representation of a cone 
program. The implementation of Matrix-Repr is described in 

Overall algorithm. With all the subroutines in place, the matrix-free canonicalization 
algorithm is straightforward. The implementation is given in algorithm [HI 

6 Numerical results 

6.1 Implementation 

We have implemented the matrix-free canonicalization algorithm as an extension of CVXPY 
Idi, available at 

https://github.com/SteveDiamond/cvxpy, 

To solve the resulting matrix-free cone programs, we implemented modified versions of SCS 
|OCPB15] and POGS |FB15] that are truly matrix-free, available at 

https://github.com/SteveDiamond/scs, 
https://github.com/SteveDiamond/pogs, 
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Algorithm 8 Matrix-free canonicalization. 

Input: p is an OPR that satisfies the requirements of DCP. 

(s, o, O) ^ Conic-Form(p). 

Choose any ordering <y on the variables in (s, o, C). 

Choose any ordering <c on the constraints in C. 

((ei, /Cl),..., (e^, ICi)) ■(— the constraints in C ordered according to <c- 
Cmat ^ Matrix-Repr((Linear(o)), <y). 

Convert Cmat from a 1-by-n sparse matrix into a length n array Carr- 
darr ^ Evaluate (Constant (o)). 

^arr vstack(Evaluate(Constant (ci)),..., Evaluate (Constant (e^))). 

G Graph-Repr((Linear(ei),..., Linear(e£)), <y). 

^opt ^ Optimize-Graph(G) 

/Ciist ^ (/Cl,..., /C^). 

return (Cam dam ^arrj O ^ )/Ciist)- 


(The details of these modifications will be described in future work.) Our implementations 
are still preliminary and can be improved in many ways. We also emphasize that the canon- 
icalization is independent of the particular matrix-free cone solver used. 

In this section we benchmark our implementation of matrix-free canonicalization and of 
matrix-free SCS and POGS on several convex optimization problems involving fast linear 
transforms. We compare the performance of our matrix-free convex optimization modeling 
system with that of the current CVXPY modeling system, which represents the matrix A in a 
cone program as a sparse matrix and uses standard cone solvers. The standard cone solvers 
and matrix-free SCS were run serially on a single Intel Xeon processor, while matrix-free 
POGS was run on a Titan X GPU. 

6.2 Nonnegative deconvolution 

We applied our matrix-free convex optimization modeling system to the nonnegative de- 
convolution problem ([T]). The Python code below constructs and solves problem ([T]). The 
constants c and b and problem size n are defined elsewhere. The code is only a few lines, 
and it could be easily modified to add regularization on x or apply a different cost function 
to c*x — h. The modeling system would automatically adapt to solve the modified problem. 

# Construct the optimization problem. 

X = Variable(n) 

cost = sum_squares(conv(c, x) - b) 
prob = Problem(Minimize(cost), 

[x >= 0]) 

# Solve using matrix-free SCS. 
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Figure 12: Results for a problem instance with n = 1000. 


prob.solve(solver=MAT_FREE_SCS) 

Problem instances. We used the following procedure to generate interesting (nontrivial) 
instances of problem ([1]). For all instances the vector c G R"' was a Gaussian kernel with 
standard deviation n/10. All entries of c less than 10“® were set to 10“®, so that no entries 
were too close to zero. The vector b G was generated by picking a solution x with 

5 entries randomly chosen to be nonzero. The values of the nonzero entries were chosen 
uniformly at random from the interval [0,n/10]. We set b = c * x + n, where the entries 
of the noise vector v G were drawn from a normal distribution with mean zero and 

variance ||c * x||^/(400(2n — 1)). Our choice of v yielded a signal-to-noise ratio near 20. 

While not relevant to solving the optimization problem, the solution of the nonnegative 
deconvolution problem often, but not always, (approximately) recovers the original vector 
X. Figure fT^ shows the solution recovered by EGOS |DGB13) for a problem instance with 
n = 1000. The EGOS solution x* had a cluster of 3-5 adjacent nonzero entries around each 
spike in x. The sum of the entries was close to the value of the spike. The recovered x 
in hgure [12] shows only the largest entry in each cluster, with value set to the sum of the 
cluster’s entries. 

Results. Figure [13] compares the performance on problem ([1]) of the interior-point solver 
EGOS |DGB13] and matrix-free versions of SGS and FOGS as the size n of the optimization 
variable increases. We limited the solvers to 10"^ seconds. 
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Figure 13: Solve time in seconds T versus variable size n. 


For each variable size n we generated ten different problem instances and recorded the 
average solve time for each solver. EGOS and matrix-free SCS were run with an absolute 
and relative tolerance of 10“^ for the duality gap, ^2 norm of the primal residual, and ^2 
norm of the dual residual. Matrix-free POGS was run with an absolute tolerance of 10“"^ 
and a relative tolerance of 10“^. 

The slopes of the lines show how the solvers scale. The least-squares linear fit for the 
EGOS solve times has slope 3.1, which indicates that the solve time scales like n^, as expected. 
The least-squares linear fit for the matrix-free SGS solve times has slope 1.3, which indicates 
that the solve time scales like the expected n log n. The least-squares linear fit for the matrix- 
free POGS solve times in the range n G [10®, 10^] has slope 1.1, which indicates that the solve 
time scales like the expected nlogn. For n < 10®, the GPU overhead (launching kernels, 
synchronization, etc.) dominates, and the solve time is nearly constant. 

6.3 Sylvester LP 

We applied our matrix-free convex optimization modeling system to Sylvester LPs, or convex 
optimization problems of the form 


minimize Tr(Zl^X) 

subject to AXB < C (12) 

X > 0, 
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Figure 14: Solve time in seconds T versus variable size n. 


where X G is the optimization variable, and A E B E C E and 

D E R^^"^ are problem data. The inequality AXB < C is a variant of the Sylvester equation 
AXB = C |(;LAM92) . 

Existing convex optimization modeling systems will convert problem (IT^ into the vec¬ 
torized format 

minimize vec(Zl)^ vec(X) 

subject to (B^ 0 A) vec(X) < vec(C) (13) 

vec(X) > 0, 

where 0 A G is the Kronecker product of B^ and A. Let p = kq for some fixed 

k, and let n = kq^ denote the size of the optimization variable. A standard interior-point 
solver will take 0{rA) flops and O(n^) bytes of memory to solve problem ([13]). A specialized 
matrix-free solver that exploits the matrix product AXB, by contrast, can solve problem 
(IT^ in flops using 0{n) bytes of memory |VB95] . 

Problem instances. We used the following procedure to generate interesting (nontrivial) 
instances of problem (!T^ . We fixed p = 5q and generated A and B by first drawing entries 
i.i.d. from the folded standard normal distribution (he., the absolute value of the standard 
normal distribution). We then added 10“® to all entries of A and B so they were guaranteed 
to be positive. We generated D by drawing entries i.i.d. from a standard normal distribution. 
We fixed C = 11^. Our method of generating the problem data ensured the problem was 
feasible and bounded. 
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Results. Figured!] compares the performance on problem (IT2l) of the interior-point solver 
EGOS |DCB13] and matrix-free versions of SCS and POGS as the size n = of the 
optimization variable increases. We limited the solvers to 10^ seconds. For each variable 
size n we generated ten different problem instances and recorded the average solve time for 
each solver. EGOS and matrix-free SGS were run with an absolute and relative tolerance of 
10“^ for the duality gap, £2 norm of the primal residual, and ^2 norm of the dual residual. 
Matrix-free POGS was run with an absolute tolerance of 10“^ and a relative tolerance of 
10“^. The slope of the lines show how the solvers scale. The least-squares linear £t for 
the EGOS solve times has slope 3.3, which indicates that the solve time scales like n^, as 
expected. The least-squares linear £t for the matrix-free SGS solve times has slope 1.7, which 
indicates that the solve time scales like the expected The least-squares linear £t for the 
matrix-free POGS solve times in the range n G [10^,5 x 10®] has slope 1.6, which indicates 
that the solve time scales like the expected n}'^. For n < 10®, the GPU overhead (launching 
kernels, synchronization, etc.) dominates, and the solve time is nearly constant. 
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A Equivalence of the cone program 

In this section we explain the precise sense in which the cone program output by the matrix- 
free canonicalization algorithm is equivalent to the original convex optimization problem. 

Theorem A.l. Let p be a convex optimization problem whose OPR is a valid input to the 
matrix-free canonicalization algorithm. Let <h(p) be the cone program represented by the 
output of the algorithm given p’s OPR as input. All the variables in p are present in <h(p), 
along with new variables introduced during the canonicalization process |Gra nl lUMZ+14) . 
Let X G R"' represent the variables in p stacked into a vector and t G R™' represent the new 
variables in <h(p) stacked into a vector. 

The problems p and ‘h(p) are equivalent in the following sense: 

1. For all X feasible in p, there exists t* such that {x,t*) is feasible in <F(p) and p{x) = 
<h(p)(a;, R). 

2. For all {x,t) feasible in <F(p), x is feasible in p and p{x) < <F(p)(a;,f). 

For a point x feasible in p, by p{x) we mean the value of p’s objective evaluated at x. The 
notation ^{p){x,t) is similarly defined. 

Proof. See |Gran4] . ■ 

Theorem lA.ll implies that p and ‘F(p) have the same optimal value. Moreover, p is 
infeasible if and only if <F(p) is infeasible, and p is unbounded if and only if <F(p) is unbounded. 
The theorem also implies that any solution x* to p is part of a solution {x*,t*) to <F(p) and 
vice versa. 

A similar equivalence holds between the Lagrange duals of p and ‘F(p), but the details are 
beyond the scope of this paper. See |Gra04] for a discussion of the dual of the cone program 
output by the canonicalization algorithm. 

B Sparse matrix representation 

In this section we explain the Matrix-Repr subroutine used in the standard canonicalization 
algorithm to obtain a sparse matrix representation of a cone program. Recall that the 
subroutine takes a list of linear expression DAGs, {ei,... ,6^), and an ordering over the 
variables in the expression DAGs, <y, as input and outputs a sparse matrix A. 

The algorithm to carry out the subroutine is not discussed anywhere in the literature, 
so we present here the version used by GVXPY ra. The algorithm first converts each 
expression DAG into a map from variables to sparse matrices, representing a sum of terms. 
For example, if the map 0 maps the variable x G R" to the sparse matrix coefficient B G 
R™'^"' and the variable y G R" to the sparse matrix coefficient C G then 0 represents 

the sum Bx + Cy. 

The conversion from expression DAG to map of variables to sparse matrices is done 
using algorithm m The algorithm uses the subroutine Matrix-Coef f, which takes a node 
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representing a linear function / and indices i and j as inputs and outputs a sparse matrix 
D. Let / be a function defined on the range of /’s ith input such that f{x) is equal to /’s 
jth output when / is evaluated on ith input x and zero-valued matrices (of the appropriate 
dimensions) for all other inputs. The output of Matrix-Coeff is the sparse matrix D such 
that for any value x in the domain of /, 

Dvec{x) = vec{f{x)). 

The sparse matrix coefficients in the maps of variables to sparse matrices are assembled 
into a single sparse matrix A, as follows: Let Xi,... ,Xk be the variables in the expression 
DAGs, ordered according to <v'. Let be the length of Xi if the variable is a vector and 
of vec(a;j) if the variable is a matrix, for i = 1,... ,k. Let rrij be the length of expression 
DAG e/s output, for j = 1,... ,i. The coefficients for xi are placed in the hrst ni columns 
in A, the coefficients for X 2 in the next n 2 columns, etc. Similarly, the coefficients from ei 
are placed in the hrst mi rows of A, the coefficients from 62 in the next m 2 rows, etc. 
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Algorithm 9 Convert an expression DAG into a map from variables to sparse matrices. 
Input: e is a linear expression DAG that ontputs a single vector. 

Greate an empty queue Q for nodes that are ready to evaluate. 

Greate an empty set S for nodes that have been evaluated. 

Greate a map M from (node, output index) tuples to maps of variables to sparse matrices, 
for every start node m in e do 

X the variable represented by node u. 

n the length of x if the variable is a vector and of vec(a;) if the variable is a matrix. 
M[{u, 1)] a map with key x and value the n-by-n identity matrix. 

Add u to S. 

Add all nodes in e to Q whose only incoming edges are from start nodes, 
while Q is not empty do 

u ■(— pop the front node of Q. 

Add u to S. 

for edge (n,p) in u’s i?out, with index j do 

Greate an empty map Mj from variables to sparse matrices, 
for edge (n, u) in n’s i?in, with index i do 
Matrix-Coeff (M,i, j). 
k the index of (n, u) in n’s A'out- 
for key x and value C in M[(n, k)] do 
if Mj has an entry for x then 
Mj[x\ ^ Mj[x\ + 
else 

Mj[x\ ^ 

if for all edges (g,p) in p’s i?in, g is in S' then 
Add p to the end of Q. 

Wend the end node of e. 
return M[(Mend, !)]• 
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